Decision Trees

Hector Corrada-Bravo and Rafael A. Irizarry
May 1, 2016

Decision Trees

We have described two type of ML algorithms:

  • Linear: linear regression, GLM, LDA, QDA
  • Smoothing: bin smoother, loess, kNN

The first is lacks flexibility

The second is does not do well with the curse of dimensionality.

The Curse of Dimensionality

plot of chunk unnamed-chunk-2

The Curse of Dimensionality

plot of chunk unnamed-chunk-3

or, if we want to include 10% of the data we need to increase the window size to \( \sqrt{.10} \):

The Curse of Dimensionality

plot of chunk unnamed-chunk-4

The Curse of Dimensionality

To include 10%, each dimension needs to cover \( .10^{1/p} \)

plot of chunk unnamed-chunk-5

Decision Trees

plot of chunk unnamed-chunk-6

Regression Trees

plot of chunk unnamed-chunk-7

Regression Trees

plot of chunk unnamed-chunk-8

Decision Trees Are Intuitive

Regression Trees

We use Regression Trees for continuous outcome:

  1. Partitions space into \( J \) non-overlapping regions, \( R_1, R_2, \ldots, R_J \).
  2. For every observation that falls within region \( R_j \), predict response as mean of response for training observations in \( R_j \).

The important observation is that Regression Trees create partition recursively

Olives

plot of chunk unnamed-chunk-9

Regression Trees

A recursive algorithm would look like this:

  1. Find predictor \( j \) and value \( s \) that minimize RSS:

\[ \sum_{i:\, x_i \in R_1(j,s))} (y_i - \hat{y}_{R_1})^2 + \sum_{i:\, x_i \in R_2(j,s))} (y_i - \hat{y}_{R_2})^2 \]

Where \( R_1 \) and \( R_2 \) result from splitting predictor \( j \) at \( s \):

\[ R_1(j,s) = \{X|X_j < s\} \mbox{ and } R_2(j,s) \{X|X_j \geq s\} \]

Example

plot of chunk unnamed-chunk-10

Example

plot of chunk unnamed-chunk-11

Example

plot of chunk unnamed-chunk-12

Specifics of the regression tree algorithm

When do we stop partitioning? We stop when:

  1. adding a partition does not reduce RSS or
  2. when partition has too few training observations.

Even then, trees built with this stopping criterion tend to overfit training data.

To avoid this, a post-processing step called pruning is used to make the tree smaller.

Pruning

tree_control <-  tree.control(
  nobs = nrow(polls_2008),
  mincut = 1,
  minsize = 2,
  mindev = 0.001)

fit <- tree(Y~X, data = polls_2008,
            control = tree_control)
cv_polls <- cv.tree(fit)

Pruning

plot of chunk unnamed-chunk-14

Classification (Decision) Trees

Score function to choose predictors and values to partitions:

  • Gini Index: \( \sum_{k=1}^K \hat{p}_{mk}(1-\hat{p}_{mk}) \), or
  • Entropy: \( -\sum_{k=1}^K \hat{p}_{mk}\log(\hat{p}_{mk}) \)

where \( \hat{p}_{mk} \) is the proportion of training observations in partition \( m \) labeled as class \( k \). Both of these seek to partition observations into subsets that have the same labels.

Example

plot of chunk unnamed-chunk-15

Tree

plot of chunk unnamed-chunk-16

Estimated Conditional Probabilities

plot of chunk unnamed-chunk-17

Prune

pruned_fit <- prune.tree(fit)
plot(pruned_fit)

plot of chunk unnamed-chunk-18

plot of chunk unnamed-chunk-19

Advantages & Limitations

Advantages

  • Highly interpretable.
  • Easy to visualize.
  • Models decision processes

Disadvantage

  • Hard to train due to greedy approach
  • Not very flexible
  • Conditional expectation is step function

Bootstrap

We use income distribution as an example:

plot of chunk unnamed-chunk-21

We are Interested in Median

The population median is

m <- median(income)
m
[1] 45369.52

Estimate from a Sample

set.seed(1)
N <- 250
X <- sample(income, N)
M <- median(X)
M
[1] 44334.42

How to Confidence Interval?

Monte Carlo:

B <- 10^5
Ms <- replicate(B, {
  X <- sample(income, N)
  M <- median(X)
})

plot of chunk unnamed-chunk-25

We Don't Have Access to Population

Idea behind bootstrap is to use sample as estimate of population: sample from the sample.

B <- 10^5
M_stars <- replicate(B, {
  X_star <- sample(X, N, replace = TRUE)
  M_star <- median(X_star)
})

Distributions are Similar

qqplot(Ms, M_stars)
abline(0,1)  

plot of chunk unnamed-chunk-28

Confidence Intervals

True CI: 39765.15 51721.77
Bootstrap estimate: 39549.38 50278.73

This is much better than what we get if we mindlessly use the CLT:

[1] 27583.05 61085.80

If Normal

If we know the distribution is normal, we can use the bootstrap to estimate the mean:

mean(Ms) + 1.96*sd(Ms)*c(-1,1)
[1] 38346.74 52635.97
mean(M_stars) + 1.96*sd(M_stars)*c(-1,1)
[1] 37741.55 51923.76

Random Forests

General idea: average multiple decision trees. We construct a forest with randomness.

The first idea is Bagging (bootstrap aggregation)

  1. Build many decision trees \( T_1, T_2, \ldots, T_B \) from training set
  2. Given a new observation, let each \( T_j \) predict \( \hat{y}_j \)
  3. For regression: predict average \( \frac{1}{B} \sum_{j=1}^B \hat{y}_j \), for classification: predict with majority vote (most frequent class)

Use Bootstrap to Generate Different Trees

To create \( T_j, \, j=1,\ldots,B \) from training set of size \( N \):

  1. create a bootstrap training set by sampling \( N \) observations from training set with replacement

  2. build a decision tree from bootstrap training set

Example

plot of chunk unnamed-chunk-32

Animation

-> <-

Predictor Subsetting

The second Random Forests feature is to use a random selection of features to split when deciding partitions.

When building each tree \( T_j \), at each recursive partition only consider a randomly selected subset of predictors to check for best split.

This reduces correlation between trees in forest, improving prediction accuracy.

Example

plot of chunk unnamed-chunk-34

Tuning Parameters

fit <- randomForest(
  as.factor(label)~X_1+X_2,
  nodesize = 250,
  data=digits_train)

plot of chunk unnamed-chunk-36

Compare Results

CART:  0.8055037
First RF:  0.801306
Second RF:  0.8264925

Digits Full Example

pred <- predict(fit, newdata = test_set, type = "class")
confusionMatrix(table(pred = pred, true = test_set$label))$overall[1]
 Accuracy 
0.9659199 

Variance Importance

feature Gini
pixel378 315.1068
pixel437 308.2002
pixel434 291.9889
pixel405 284.2842
pixel290 252.8761
pixel409 247.7304
pixel350 245.8441
pixel377 228.2036
pixel515 227.6764
pixel542 226.3225

Variable Importance

plot of chunk unnamed-chunk-41

Variable Importance

plot of chunk unnamed-chunk-42